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ABSTRACT 

"Linear  threshold  element"  is  the  generic  term  for  a  device  which 
forms  the  sim  aQ  +  sli^i   +   ^2^2  ■*'•••■'■  ^d'^d  ^^'^^  ^^   input  vector  (x  ,  X2, 
...,  x^)  and  yields  one  of  two  outputs  depending  on  whether  or  not  the 
sum  is  positive.  A  pattern  classification  machine  may  utilize  a  linear 
threshold  element  along  with  a  controller  which  receives  the  one  of  the 
two  values  corresponding  to  correct  classification  of  the  input  vector. 
The  purpose  of  the  controller  is  to  modify  the  gain  vector  (a^,  a-i,  ..., 
aj)  so  that  the  next  input  vector  has  a  greater  likelihood  of  being  cor- 
rectly classified  by  the  threshold  element. 

This  likelihood  depends  on  the  value  of  the  gain  vector  and  an 
adaptive  algorithm  of  the  "steepest  descent"  variety  can  be  used  to 
attempt  to  adjust  the  gain  vector  to  its  optimal  value  as  the  machine  is 
exposed  to  a  stationary  sequence  of  statistically  independent  input  vec- 
tors. The  components  of  these  vectors  are  commonly  two  valued,  and  it 
has  been  shown  that  convergence  of  the  expected  value  of  the  gain  vector 
is  dependent  on  the  value  of  the  adjustment  parameter,  the  values  of  the 
components,  and  the  distribution  of  the  input  vectors.   It  is  shown  herein 
that  a  bound  on  the  adjustment  parameter,  simply  related  to  the  values  of 
the  input  components,  is  sufficient  to  insure  this  convergence.   The  var- 
iance of  the  gain  vector  is  derived  under  the  assumptions  of  a  uniform 
input  sequence  and  oppositely  signed  components  of  equal  magnitude  and 
it  is  shown  that  a  similar  bound  on  the  adjustment  parameter  implies  con- 
vergence of  the  variance.   The  variance  is  graphed  under  representative 
conditions. 
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1.   Introduction. 

"Linear  threshold  element"  (LTE)  is  the  generic  term  for  a  device 
which  forms  the  sum  Aq  +   a^x-,  +  3^X2  +  . . .  +  a^Xj  from  an  input  vector 
(x]^,  Xjj  X  ,  ...  ,  Xj)  and  yields  one  of  two  outputs  depending  on  whether 
or  not  the  sum  is  positive.  The  components  of  the  input  vectors  are  com- 
monly two  valued,  and  therefore  the  total  number  of  possible  input  vec- 
tors is  2  .  When  used  in  a  pattern  classification  machine  (PCM)  the 
output  of  the  LTE  classifies  each  input  into  one  of  two  classifications. 
This  classification  may  or  may  not  be  correct.   The  correct  classifica- 
tion is  given  by  the  environment,  a  fixed  but  unknown  function  defined 
on  the  same  set  of  input  vectors,  whose  output  is  either  of  the  two 
values  of  the  LTE. 

As  an  example  of  an  environment,  consider  a  handwritten  letter  of 
the  alphabet  which  is  "read"  by  an  array  of  mark  sensing  devices.   The 
input  to  the  environment  is  the  pattern  from  the  sensing  devices  and 
the  output  is  vThether  or  not  the  letter  which  was  read  is  a  particular 
letter.   Consider  also  medical  diagnosis.  The  electrocardiagraph  of  a 
particular  human  heart  may  be  sensed  as  above  and  the  output  of  the  en- 
vironment is  v^ether  or  "not  a  particular  anomaly  is  present.   In  either 
case  if  the  same  sensing  pattern  is  presented  to  an  LTE,  it  will  also 
make  a  classification.  The  niomber  of  similar  applications  is  large. 

In  addition  to  the  LTE,  a  PCM  may  use  a  controller.  This  device 
receives  the  environmental  response  to  an  input  vector  and  attempts  to 
modify  the  gain  vector  (aQ,  a-^,    a-,  ...  ja, )  so  that  the  next  input 
vector  has  a  greater  likelihood  of  being  correctly  classified  by  the 
LTE.   Several  algorithms  have  been  proposed  to  adjust  the  gain  vector. [l] 
The  method  under  consideration  is  the  steepest  descent  algorithm  of 


Widrow  and  Hoff.  [2,3]  It  will  be  discussed  in  section  2. 

The  gain  vector  is  given  an  initial  setting  a(l),  and  then  the 
PCM  is  exposed  to  a  potentially  infinite  sequence  of  input  vectors 
[  X(n)}   and  the  corresponding  sequence  of  correct  output  values  from 
the  environment  [f[x(n)]},  n  =  1,2,3,...  .  After  each  input  vector  is 
presented,  the  gain  vector  is  adjusted  by  the  controller  to  yield  the 
sequence  [a(n)}  .   Under  certain  conditions,  this  sequence  will  con- 
verge to  a  terminal  vector  a*  which  is  optimal  in  some  sense  for  the 
correct  classification  of  an  input  vector  by  the  LTE. 

It  is  assumed  that  the  sequence  of  input  vectors  is  a  strictly 
stationary  stochastic  sequence  of  independent  random  variables,  ie. 
X(n)  =  X,  where  X  is  a  random  variable  whose  statistical  properties  are 
completely  described  by  the  probability  vector  P  =  (  p-,,  ^2*    ••••»  P-^d^ » 
and  p.  =  Pr  [x  =  X^}>  0  for  j  =  1,  2,  3,  ...  ,  2^^  and  T._Pj   =  1.   A  fre- 
quent example  is  the  uniform  input  sequence:  p,-  =  2~  for  all  j,  which 
implies  that  the  occurrence  of  each  of  the  2  input  vectors  is  equally 
likely. 

The  LTE  is  used  to  predict  the  environment,  and  a  measure  of  its 
ability  to  perform  this  task  is  the  state  of  the  PCM,  S(a),  defined  as 
the  expected  value  of  the  squared  difference  between  the  responses  of 
the  LTE  and  the  environment  with  respect  to  the  random  variable  X.   The 
state  of  the  PCM  is  zero  if  and  only  if  the  responses  of  the  LTE  and 
the  environment  are  equal  for  all  input  vectors.   The  task  .of  the  con- 
troller is  to  minimize  S  with  respect  to  the  gain  vector. 

Due  to  the  discontinuity  of  the  step  function,  the  minimization  of 
S  will  not  be  without  difficulty.  Consider  an  auxilliary  measure  of  the 
performance  of  the  PCM,  Q(a),  the  expected  value  of  the  squared  differ- 


ence  between  the  response  of  the  environment  and  the  sim  a«  +  a^x^  + 
...  +  ajXj  which  is  internally  generated  by  the  LTE.  This  auxilliary 
measure  is  the  one  chosen  for  minimization  and  it  is  believed  that  the 
following  theorem  is  correct,  although  a  satisfactory  proof  is  not- 
known  to  exist. 

Theorem  1.   If  Q(a*)  ^  Q(a)  for  all  gain  vectors  a, 

then  S(a*)  <.    S(a)  for  all  gain  vectors  a. 

To  summarize  the  assumptions  and  notation,  let  the  possible  values 
of  the  components  of  the  input  vectors  be  q  and  r,  then  the  LTE,  g,  is 
defined  on  the  set 

B*^  =  {  XJ  :  X^  =  (  xg,  x^,  ...,  x^)"^;  x^  =ma4|q|,  |  r|  ]  ^ 

x^e{q,  r}  ,  i  =  1,  2,  ...,d;  j  =  1,  2,  ...,2   }  . 

J  T     ' 

Let  A  =  [a:  a  =  (aQ,  a  ,  ...,a,)  ]   !  be  the  set  of  gain  vectors. 

Then  g(X)  =  sgn(  a^X) ,  where  sgn(t)  =/  1,  if  t  >  0 

^-1,  otherwise, 

and  R  =  {  1,  -l}  is  the  range  set  of  g.   The  environment,  f ,  also  maps 

B  into  R.   Note  that  each  environment  could  be  interpreted  as  one  of 

the  2   Boolean  functions  of  d  binary  variables.   The  measures  are  as 

follows. 


S(a)  =[f(X)  -  g(X)]^     3^^ 
Q(a)  =[  f(X)  -  a^x]2  , 


X  is  a  random  variable  which  takes  on  values  in  B  ,  all  with  a  positive 
probability,  in  accordance  with  the  probability  vector  P.   Note  that 
all  functions  of  X  are  therefore  random  variables. 


2.      The  Adaptive  Algorithm 

The  problem   is   to  minimize 


Q(a)   =[  f(X)   -  a^x]2  =  f^(x)    -   2^_Qa^f(X)x^ 


d     d 


+  S     E  a,  a,  3Cf  X    . 

^  =0  k=0     ^  '^^  ^ 


Setting  9Q^^)  =  0,  Vi,  yields 


8  a, 


t  =0 

That  value  of  the  gain  vector  which  satisfies  equation  (1)  is  the 
vector  a*,  which  minimizes  Q(a).  But  since  f  is  unknown,  the  equation 
cannot  be  solved  directly. 

Using  the  method  of  steepest  descent,  the  gain  vector  is  modified 
after  each  presentation  of  an  input  vector: 

a(n+l)  =  a(n)  +ri  gradQ^, 

where        a(n)  is  the  value  of  the  gain  vector  at  the 
time  of  the  n-   presentation; 
Q^  =  [  4x(n)]   -  a'^(n)X(n)}  ^', 

X(n)  is  the  n   input  vector; 

a  On   a  Qn      9  Qn  , 

gradOn  =  -  (.-^^   ,  _ii,  . . . ,  )  ; 

o  ^0   8  ai       a  a^ 

"H  is  a  positive  constant. 
Each  component  is  adjusted  in  the  direction  of  decreasing  Q  . 
Specifically, 

a,(n+l)  =  a.(n)  +  2nx,(n)  [  f[X(n)]  -  a'^(n)X(n)  }  ,Vi.         (2) 


It   is  to  be  noted   that   a(n)   is   a  random  variable.      Now  show  that "aCn) 
converges   to   a*.      Martinez  shows   this  as   follows:[4] 

Rewrite  equation    (2)    as 

a(n+l)    =  a(n)    +  d[\  -  C^  a(n)]  (3) 

where  the  adjustment  parameter 3  =  2r|,  b  =  f[X(n)]X(n), 

and  C^  =  X(n)x'^(n).   Note  that  C^  =  cj. 

Hence  a(n+l)  =  6  b^^  +  [  I  -  3  C^^  ]  a(n) 

=  d^  +  E^a(n),  (4) 

where  d^  =  0b^,  and  E^  =  I  -0  C^. 
Expanding  recursively  and  taking  expected  values, 


a(n)  =d   ,+E   .d    +E  ,E   d  ^  +    . ., 
n-1    n-1  n-2    n-1  n-2  n-3 


+  E„  ,E   „  ...E,a(l).  (5) 

n— i  n-2     ■'• 

Due  to   the  assumptions  of   independence  and   stationarity  on  the 
input   sequence,   equation    (5)   becomes 


2  n-2  ,  n-l- 


L(n)    =[l  +  E  +  E+...    +E~    ]d  +  e"  a(l) 


where  E 


=  E_,   and  D  =  d    , 
n  n' 


-1  r  n-1,  n-1- 


=    [I-eJ        [I-E       ]D  +  E        a(l).  (6) 

If  -tim  e'^  =  0,   then 

>tim  a(n)    =[  I  -  E]  ~  D. 
n-»«> 

It  is  then  shown  that  if  C  =  C^  is  positive  definite,  and  if  8  < 

2^  ,  where  X  is  the  largest  eigenvalue  of  C,  then  -tim  E^  =  0.  Hence, 
\  n—^ 

if  C  is  positive  definite,  and  the  positive  constant  "His  less  than 


the  reciprocal   of  the   largest   eigenvalue  of  C,   then  the  modification 


given  by  equation   (2)  will  cause  a(n)    to   converge  to   a*,    since 
II-e]~D  =  C     b,   where  b  =   bj^,   and   if  a'    is  the  terminal  value  of 


a(n),   then  b  =  Ca',   which   is  the  minimizing  equation  (1). 

Under  what  conditions   is  C  positive  definite?     Consider  C     =  X-'xJ    , 

and   let   Z  be  a  non   zero  vector.      Then  z'^C^Z  =  z'^X^'x^'^Z  =    (z'^X^)^  ^  0, 

which  shows  that   C-'   is  positive  semi  definite  for  all   j. 

2d 
C  =  "C^  =  S  P -C   ,   and   if  Z    is   as   above,   then 

Z^CZ  =    E  p  .z'^c^z   S  0. 

:  =  i    J 

Therefore   C  is  always  positive  semi  definite.      That   it   is,    in  fact, 
always  positive  definite  is   shown  by  contradiction.      Assume  3    W  3 
W^CW  =  0. 
Then  p -W-^C^W  =  0,    for  all   j.      Since  p-  >  0   for  all    j, 

w'^C^W  =  0   for   all   j,   which   implies   that 
(w'^xJ)2   =  0   for  all   j,    and 

W'^XJ   =  0   for  all   j.  (7) 

Equation  (7)    is   a  linear  system  of   2     equations,    and   if  the  vector  W  is 

a  solution,   then  it  must   satisfy  a  subsystem  of  d+1  of  the  equations. 

^1        ^2  ^d+lrt 

Let[x      ,  X      ,...,   X       ^    be  a  set  of   linearly  independent  vectors   from 

B   .      Such  a  set  exists   since  B     spans  d+1  space.      Hence 

_       b     by  b 

W^    (XX      X  ^'*'^)    =  0.  (8) 

Now  a  nontrivial   solution  to   ^)    exists   if  and  only   if 

b,      b^  b,    , 

|x  ^  x^...  x^*M  =  0. 
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But  this   set  of  vectors   is   linearly  independent  and  their  determinant 

T 
is  non   zero.      Hence  for  all   non   zero  vectors  W,  W  CW  ?^  0,    and  C   is 

always  positive  definite. 

What   is  the  range  of  the   eigenvalues  of  C?     For  each  C-',   diag(C^)   = 

((x^)^,      (x^)^,...,,    (x^)^),    and   it   follows   that  diag(C)    =   (   e^,    e, ,    

old  u       i 

e^) ,  where  , 

2  j   2 

e.    =  S  p.(x  )      for  all   i. 

^        3   =   1    J      i 
There  is  no   loss  of  generality  in  assuming   iq|^|rj  ,    and   in  that   case 


2 
e^   =  r     and 


q2<    e.  <    r^   for   i  =   1,    2,...,d. 


Hence  trace-    (Q)   =  S  e.    <     (d+l)r^. 


i   =  0      " 


Let  ^0'^  i»****^d  ^'^  ^^^  eigenvalues  of  C.      Then,    since  trace   (G)    = 

S     >^i  and  X  ^  >  0  for  all  i, 
i  =  0 

^  2 

\    <    T.  \  .  <   (d  +  l)r  . 

i  =  0  ^ 

2 

It  then  follows  that  if  the  adjustment  parameter?^  «—  ,  then 

(d+l)r^ 


a(n)  will  converge  to  a*,  since  C  is  always  positve  definite  and 

2  2 

8  ^   s-     =>   6  <  . 

(d+l)r^  X 


The  remainder  of  this  section  examines  the  convergence  of  a(n)  under 
the  following  assumptions. 

(1)  The  components  of  the  input  vectors  are  of  equal  magnitude 

and  oppositely  signed,  ie.   0<  -q  =  r. 

-d 

(2)  The  input  sequence  is  uniform,  ie.  p.  =  2   for  all  j. 
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2 
With  these  assiimptions  C  wlII  be  shown  to  reduce  to  r  I,  which  simpli- 
fies further  analysis  without  sacrificing  a  great  deal  of  applicability, 
since  any  PCM  can  be  reworked  to  make  the  transformation  of  (1)  above, 
and  (2)  above  is  a  common  ad  hoc  approach  to  a  practical  situation. 

Consider  the  following  constructive  scheme  for  the  input  vectors 
which  is  analogous  to  the  binary  representation  of  the  integers. 

1  T 

X  =  (  r,  -r,  -r,  ...  ,  -r,  -r,  -t   ) 

9  T 

X''  =  (  r,  -r,  -r,  ...  ,  -r,  -r,   r  ) 

3  T 

X  =  (  r,  -r,  -r,  ..,.,   -r,   r,  -r  ) 

k  T 

X  =  (  r,  -r,  -r,  . . .  .  ,  -r >  r,  r  ) 


v2^   ^  ^T 

X  =  (  r,   r,   r,  ... . ,   r,  r,   r  ) 

This  scheme  could  also  be  displayed  as  the  matrix  M  =  (m,  .)  with  each 

xj 
element  of  the  form  where  (contrary  to  the  usual  convention) 

r 
i  is  the  column  index  -  which  refers  to  the  components  of  a  particular 

input  vector  -  and  j  is  the  row  index  -  v/hich  refers  to  a  particular 

vector  from  B  .   The  order  of  M  is  2  (d  +  1).  An  analytic  expression 

for  m .  .  is 

•  ^  ,  .  ,v   -.^d-i+l  ^  „d-l 
-1,  if  (j-l)mod2      <  2 

1,  otherwise. 


12 


M  = 


col 
0 

col 

col 
2 

fl 

-1 

-1 

1 

-1 

-1 

1 

-1 

-1 

1 

-1 

-1 

1 

—  1 

-•  1 

col 

col 

col 

CO 

d-3 

d-2 

d-1 

d 

-1 

-1 

-1 

-1 

-1 

-1 

-1 

-1 

-1 

1 

-1 

-1 

-1 

1 

_l 

-1 

••  X 

1       -1       -1    ...      1        1        1         1 
1-11    ...    -1       -1       -1       -1 


1-1         1    ...      1         1         1         1 
1        1       -1    ...    -1       -1       -1       -1 


row  2° 

^1 
row  2 

row  2^   +1 

row  2^ 

row  2^+1 


od-2 
row  2 

row  2*^"^   +   1 


od-1 
row  2 

row  2  +1 


W 


row  2 


Now  it  is  necessary  to  show  that  the  column  vectors  of  M  are 
mutually  orthogonal.   Using  the  method  of  induction  of  the  dimension  of 
the  PCM,  let  d  =  1.   Then 

/l  -l1 


M,    =i  j      =      (N     N  )      where  Nq   is   the  colvimn  vector!     land 

and  N,    is   the  colxann  vector!       j.      There  is  only  one  pair  of  vectors   to 

T 
check.      NqNt    =-1   +   1   =  0.      The  vectors   are  orthogonal.      Now  let  d  =   -t 
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th 
and  M «    =    (   Cp.C-,C2    ...    C^  )    where  C^   is  the   i —  colximn  vector  of  M».      The 

induction  assumption   is   that  the  column  vectors  of  M.   are  mutually  ortho- 
gonal.     Then   if   K.    is  the  i —  column  vector   for  the  dimension  >t+   1, 

M,       ,    =    (K-K,  K„...K  )    which  can  be  partitioned 

^  +    1  012         -0+1 

with  respect   to   rows   as 

^0  ~  S^lS    "•   ^{ 

Cq        CqC^C2    •••    <x, 

With  this  partitioning  any  column  vector  of  M,     can  be  expressed  as  a 
direct  sum  (a  physical  concatenation)  of  the  column  vectors  of  M. ,  t£  wit; 

[5]   Kq  =  Cq  +  Cq,   K^  =  (-Co)   +  Cq,  and  K^  =  C^_^  +   C^_^  for  i  =  2,  3, 

T 
...,-t+l.   Then  any  product  of  the  form  K  K.  for  i  =  2,  3,  . . . , -t +1  can 

be  expressed  as 

(Co  ;  Co)T(C._^  '.   C._^)   =  cjc._,  .  cjc._^  =  0.0  =  0, 

with    kJk^  =  (  Cq  ;  Cq  )^((-Co)  ;  Cq)   =   -cJCq  +  CqCq  =  0. 

T 
Similarly,  any  product  of  the  form  KnK-  for  i  =  2,  3,...,-t+l  can  be 

T 
expressed  as   ((-Cg)  +  Cq  )  (  C^_-|^  +  C^_^) 

T  T  T 

=  -C«C.  1   +  C^C.  ,  =  0,  and  any  product  of  the  form  K.K-  for 
0  x-1      0  L-1     J        ^  r  ^  J 

1  <  i  <  j  <  -t  +1  can  be  expressed  as 

T. 


<Oi-i  *Ci.i)^Cj.j  ;o._j  ) 


=  cT  ^c .  ,  +  cT  ^Cj  T 
1-1  j_i   1-1  3-1 


=   2=1-1=3-1 

=  0,  since  all  C-'s  are  orthogonal. 
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Thus  the  colxmn  vectors  of  M    are  mutually  orthogonal,  and  by  induction 
it  is  true  that  for  all  positive  integral  values  of  d,  the  columns  of  the 

matrix  M  are  mutually  orthogonal. 

d 

2  k  -d 

Now  consider  the  elements  of  C  =  S        p.  C     v^ere  p     =  2      ,   and  the 

k   =1   •"  ^ 

input   components   are  1  r. 

2  2 

_d  k  -d  ,     k  kT 

c.  .      =      2  °  S        c..        =        2       V       (  X  X     ).  . 
i-J  t^=l      ^3  k=l  ij 

-d  k  k 

=  27     x-x  • 

k=l  ^1   J 

d      2^  - 

o~^    2v  .                                1 

=   2     r  2-     m.,m.,  ,  since  m.  .  =  . 

k=l    ik    jk'  Lj       — - — 

2 
This   reduces  to   c.  .   =  r  6.  .,    where  6 .  .    is  the  Kronecker  delta, 

2^ 
since        2      m.   m  =     0   if   i     /     j,    due  to   the  orthogonality  of  the 

k=l      ^^   J^ 

2 
column  vectors  of  the  matrix  M.      Hence  C   =  r  I. 

Under  these  assumptions   equation   (6)  becomes 
-1    r  1^-1  ,  n-1 


i(n)      =[i~E]  [I-E  ]D  +  E        a(l) 

=  [8C]  "■^[  I   -    (   I  -0  C)''"    ]  D  +     [l   -  8C3  ^""^a(l) 

=  [Brh]-^    [I   -    (   I   -er^i)""^]  D  +tl   -g  r^i  J  ^-^aCl) 

=       -^  [   1   -    (   1   -8r2)"-l]eb   +    (   1   -er2)^"^a(l) 

=       -V  [    1   -    (   1   -6r2)^-l]    Ca*      -h    (    1   -0r2)^-^aa) 
r 
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=   [l   -    (   1   -Br^)"""-^]  a*     +     (    1   -8r^)^"-^a(l)  (9) 


-  [  a*  -   a(l)]  (   1   -8  r^    )'''^    . 


=      a*  -  L  a*  -   a(l)  J  (   1   -  8  r'    )  .  (10) 

2 
Equation  (1)  can  be  written  as  b  =  Ca*,   which  becomes  b  =  r  a*,    and 

the  optimal   value  of  the  gain  vector,   a*  =       ^        .      Then 

r2 

d     d       d  

Q(a*)    =  Poo      -   2E     a*f(X)x,    +   S      S     a/a*  x  x, 
1=0^  ^     -t  =0  k=0  '^   "^    ^    •" 


d  d       d  2 

=   1   -  2E     a^r^a*     +    S      S     a*a*  rC, 

;=0'         ^        ^=0  k=0^   '^        ^^ 

=   1   -  r^(a*)2  (11) 

From  equation    (11)    it    is  evident  that    if  Q(a*),    the  optimim  of   the 

2  -2 

minimization  effort,    is   near   zero,    then  (a*)      ls   near  r      .      Note  also  the 

2  -2 

range  of    (a*)    ,   between   zero   and  r 
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3.      The   Variance  of  the  Gain   Vector. 


Consider  the  variance  of  a(n),   v[a(n)]      =  a    (n)    -  a(n)    .      It   is 


~1 — 
necessary  to   compute  a   (n) ,    which  will   be  done  under  assumption    (1) , 

the  components   are  ^  r.      Ftom  equation   (4), 


a(n+l)   =  d        +     E  a(n) , 


where  d     =0b     =  p  f  [  X(n)]    X(n)  , 


T 
E^  =   I-ec=e  X(n)X   (n), 


and  let  0  <  0  < 

(d+l)r2 


Then  a^(n+l)    =  [  d'^  +  a^(n)E^][  d     +  E  a(n)] 

n  n    "■     n         n 


d^     +     2d^E^a(n)  f   a'^(n)EnE„a(n), 
XI  n  n  II  n  ' 


which  can  be  written  as 


i^Cn+l)      =     e^r   (d+1)   +  2  [    1-  8r^(d+l)]  d\(n) 


-fc«  a   (n)  f  I  +  B[er2(d+1)   -  2  J  C^  ]  a(n),  (12) 

since  d^     =    ^  ^n®  ^n     =  8  ^f  [  X(n)]     x'^(n)f  [   X(n)  J  X(n) 

=  a^x'^(n)X(n) 
=  8^r^(d+l), 

and     d^E^     =     d^    [  I  -6C^]     =    0  f[  X  (n)  ]  x'^(n)    [   I  -  B  X(n)x'^(n)  J 

=  8  f[  X(n)  ][x'^(n)    -e  r^(d+l)x'^(n)  J 
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=  i  1   -Sr^Cd+l)  ]  d    , 

n 


and      eJe^     =   I   -   2BCr^  +  0  ^C^^ 


=   I   -   $C^  +  B^r2<d+l)C^. 


=   I   +Q[  6  r^Cd  +  l)    -  2  JC^. 


Taking   the   expected  value  of   equation    (12), 


a^(n+l)      =     8^r^(d+l)      +  2[    1   -g  r^Cd+l)]    d'^a(n) 


+     a^(n)      +  B[Br^(d+l)   -   2j   a\n)C^a(n).  (13) 

T  T 

Before  considering  the   expected  values  of  d  a(n)    and   a    (n)C  a (n) , 

recall  the   following  theorem,   as   stated  by  Halmos.[   6  J 

If[    f..:i  =   1,    2,...,   k;    j   =   1,    2,...,n.]is   a  set  of   inde- 
pendent  functions,    if  ro-    is   a  real   valued,   Borel  measurable  function  of 
n-    real  variables,    i  =   1,    2,    ...,    k,    and  if   f.(x)    =q3-(f.^(x),    ...,    f. 


in. 

1 


(x)),  then  the  functions  f-,,  ...,  f^   are  independent. 

Now  consider  the  set  of  independent  random  variables  (X(l),  X(2),..., 
X(n-l),  X(n)]  .   By  definition  d^  =ef[  X(n)J  X(n)  and  is  defined  on  the 
subset  [X(n)]  .   On  the  other  hand  a(n)  is  defined  on  the  complementary 
subset  [  X(l) ,  ...,  X(n-l)]  by  the  recursion  relation  of  equation  (2). 
Therefore,  applying  the  theorem  to  each  component  of  these  vectors,  the 

conclusion  is  that  a(n)  and  d^  are  independent  random  variables.   This 

T 
being  the  case,  the  expected  value  of  d  a(n)  is  the  product  of  the  ex- 
pected values  of  d  and  a(n) .   Hence, 


d  a(n)   =  d^a(n)   =  6  b  a(n).   Then  using  b  =  Ca*  and  equation 
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(6), 


dJaCn)    =8  (a*)^C{  (    I  -   E  )"-^    (I   -   e"'-^   )D  +  E^'-^a  ] 

where  a  =   a(l)  , 

Tr  r  ri-1  ,  ,  n-1 

=  R  (a*)    f[l-(I-RC)         J  a*      +      (I-QC)        a 

=  R(a*)'^Ca*  -R(a*)^C(    I  -  8  C  )'^"    (a*  -  a    )  (14) 

T 
Next   consider  the   expected  value  of  a    (n)C  a(n). 


aT( 


n)Cj^a(n)      =     £      a^(n)S     X£(n)x  (n)a  .(n) 
i=0  j=0  ^  '' 


d       d 


=     21      ai(n)a,(n)x,(n)x.(n)    ,  (15) 


E       E      ai(n)a.(n)x.(n)x.(n)    , 
i=0   j=0  J  ^  J 


since  a(n)  and  X(n)  satisfy  the  hypothesis  of  the  independence 
theorem  above. 


9 

The  goal   now  is   to   express  this  expected  value  as   a   (n)G(C) ,   where  G 
is   some  unknown  function  of  C.      Equation   (15)    is   close,   but   it  contains 


terms   in  a-(n)a.(n)   with   i^j,   v^ich  have  not  yielded  to  analysis.      Is   it 
possible  that  these  cross   product  terms  vanish?     Or   is   it  possible  that 
their  coefficients  x.Cn)x.Cn)  vanish  for   i^j?     The   latter  is,    in  fact, 
exactly  the  conclusion  arrived  at  by  assimption    (2),   the  uniform   input 
sequence.      Under  this   assumption  the  derivation  can  proceed,    and   it  will, 
justified  by  the  urgent  need   for  some  results,   however  special. 

With  G  =  r   I  used   in  the  expected  value  terms,    equation  (5)    becomes 


d 


a^(n)Cna(n)      =     E     r2a?(n)      =      x'^^^in'), 
i=0 
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and  equation  (14)  becomes 


d'^a(n)   =  0  r^(a*)^  -  2gr^(a*)'^(a*  -  a)  (1  -  0  r^)"""^, 
n 


Finally  returning  to  equation  (13), 


a2(n+l)  =0  2^2(.^^^)  ^  20r2  [  1  -  0  r^  (d+l)J  (a*)^ 

-2gr^[  1  -0r^(d+l)J  (a*)'^(a*  -  a)(  1  -0  r^)^-'^ 


+  [  (  1  -0r^)^  +  0^r^d  J  a2(n)   .  (16) 

The  structure  of  equation    (16)    is  more  readily  apparent    if  the 
following  substitutions  are  made  for  the  constants  of  the  process. 

Let  a      =0r^[   0  (d+1)   +   2[     1-0  r^(d+l)J    (a*)^}   , 

Y       =  -20r^[     1  -e  r    (d+l)J    (a*)    (a*  -  a)    , 

6       =    (   1   -  0  r^) 2     +   e^r^d ,   and 

0       =    (   1   -8  r^). 

Then  ^   ^  

aMn+1)      =  a  +Y  P  ^~        +6  a2(n)    ,  (17) 

which  when  reworked  recursively  one  step  becomes, 


n-1         .  r  n-2 


a^(n+l)   =   a+YD~       +6[a     +YD  "     +    6  a  (n-1)  ] 


n-2  2~1 

=  a  (    1   +6  )    +  Y  (  D    +6  )d  +   6  a   (n-1)    , 


and  one  more  step, 


a^(n+l)  =  a(  1  +6  +6^)   +  Y  (D^  +  p6  +  6  2)p  ^"  +  6  \^(n-2) 
and  through  all  the  steps  back  to  a(l),  is 


20 


a^n+1)   =  as   6  ^  +  Y^  p^^'^'h^     +  6^(1).  (18) 

i=0       i=0 

The  geometric  series  may  be  put  into  closed  form,  yielding 

^~1  •  1  R  ri  n-1  .  ,  .  ^  n  a  n 
V  A  1  1-0  J  r  n-i-lsi  0  '  -  0  " 
HO         =  — g —  ,   and      S     d  o-^  =  . 

1=0  i=0  D    -6 

Now  replacing  n  by  n-1,    and  making   the  above  substitutions    in  equation 
(18), 

n-1  1^-1       ,  n-1 


The  denominators  in  equation  (19)  can  be  rewritten  as  follows. 
1  _  6   =  e  r^[  2  -B  r2(d+l)]  ,  and 
D  -6   =  B  r^[  1  -3  r^(d+l)  ]  . 
Now  making  all  the  substitutions  for  a  , Y  »  andD  ,  and  cancelling  the 
new  forms  of  the  denominators, 

n-1 
2  -  0r^(d+l) 


a^(n)  =f0(d+l)  +  i    1  -er^(d+l)]  (a*)^  }{  /  ~J  2 1 


-   2(a*)^(a*   -   a)f   (1   -Br^""^   -6''"M     +6''-^.  (20) 

Collecting  all  the  terms   in  6  and    (1   -6  r  )  , 

-27",            B  (d+l)    +   2  [  1   -  B  r^(d+l)  ]{  a*)  ^ 
a   (n)    = s 

2    -ar'^(d+l) 

^   r      2(a*  -  a)^   -Fg(d+1)[    r^(2(a*)'^  -   a'^)a  -   jJ  ^    n-1 

2-3  r^(d+l) 

-   2(a*)^(a*  -  a)(l   -Br^)^~^,  (21) 

Equation  (10)  provides  a(.n)  under  assumptions  (1)  and  (2)  which 
when  squared  yields 
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2 

2        o/_^nT,_^        .wn       o     2.n-l 


"ilnT     =    (a*)      -  2  (a*)    (a*  -  a)  (1   -  0  r  ) 


+    (a*  -  a)2[    (1   -0  r2)2  ]  ^-1    , 


(22) 


Combining  equations    (21)    and   (22),   the  variance  of  the  gain  vector   is 

2 


v[  a(n)]     =  a^(n)  -  a(n) 


=     p(d+l)   +   2[    1  -g  r^(dH-l)]    (a*)^ 
2   -  Br^(d  +  1) 

+r    2(a*  -   a)^   +  e(d+l)L  r2(2(a*)^  -   a^)a  -    ij  ,      n-1 
f 16 

2   -  0r^(d+l) 

-   (a*)  2   -    (a*  -   a)^(    1   -3r2)2j   n-1    ^  ^^3) 

Combining  the  constant  terms   leaves 

V[a(n)]      =     B(d^l)[     1   -   r2(a*)2j 
2   -0  r2(d+l) 

+f    2(a*  -  a)^   +e(d  +  l)  L    r^(    2(a*)^  ~  a^)a  -   ij  ,  ^  n-1 
2   -  0r^(d+l) 

-    (a*-a)n    (1   -0r2)2j-l        ^ 

where  6  a   (1  -  0  r^)^  +  S^r^d  .  (24) 

Now  that  the  variance  of  the  gain  vector  has  been  derived,  the 
question  is  whether  or  not  it  converges,  and  if  so,  to  what,  and  under 

what  conditions?   The  bounds  on  0  which  imply  convergence  of  the  mean 

2 
are  zero  and  — j     (under  both  assumptions).   Hence  the  bounds  on 

r 

(1   -  0r2)2  are   zero   and  one.      Therefore   limit  [  (1   -      r  )  J    ~     =0.      On 

n-wo 

the  other  hand  6  is  a  quadratic  expressioning which  is  less  than  one  for 
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3  between  zero  and  ■ s—  and  therefore  this  lesser  bound  must  be 

(d+Dr"^ 

observed   in  order   for  the   term   in  6    to  vanish,    and  consequently   insure 

the  convergence  of  the  variance  of   the  gain  vector.      Hence,    if   both  as- 

svimptions    (1)    and    (2)   are  valid,    and   if   0<  6  <  ^ ,    then  both 

(d+l)r2 

a(n)  and  v[a(n)]  will  converge. 

The  mean  will  converge  to  a*,  that  a  which  minimizes  Q(a),  and  the  var- 
iance to  «     r      9    9  1 

V  =  limit  V[a(n)]  =   S(d.l)[  l-r^Ca*)^]  _         ^^^^ 

n-«  2  -B  r^Cd+l) 

Now  recall  the  relationship  from  equation  (11)  between  a*  and  Q(a*) 
which  enables  equation  (25)  to  be  written  as 

V  =      ^  )^^^    —  ^  where  the  bounds  on  Q  are  zero  and  one 

2  -0  r2(d+l) 

and  can  be  interpreted  as  a  measure  of  the  complexity  of  the  environment 
to  which  the  PCM  is  exposed.   Since  the  derivative  of  V  with  respect  to 
3  is 

2(d+l)Q(a  ) ^  which  is  non  negative,  and  since  V  vanishes 

[2-3  r2(d+l)]2 

for  3  =  0,  the  smaller  3  ,  the  smaller  V. 

Exactly  how  small  V  must  be  in  order  for  the  state  of  the  PCM  to  be 
minimized  is  an  open  question.   Intuitively,  it  is  felt  that  S(a)  reaches 
its  minimum  before  (in  the  input  sequence)  Q(a)  is  minimized,  and  the 
answer  is  probably  also  the  answer  to  Theorem  1.   Further  work  on  this 
problem  is  indicated.   Also,  of  course,  is  the  need  for  generalizing  this 
work  to  apply  to  any  input  sequence.   Once  these  details  are  in  order,  it 
should  be  possible  to  obtain  an  analytic  expression  for  the  number  (or 
average  number)  of  trials  to  achieve  a  minimum  for  the  state  of  the  PCM. 
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other  algorithms  and  configurations  vAiich  may  yield  to  analysis  can  be 
found  in  Nilsson.  [  ij 

In  the  appendix  are  the  results  of  evaluating  the  variance  of  the 
gain  vector  for  some  representative  values  of  the  parameters. 
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APPENDIX  I 
Assumptions:      (1)   The   input   components   are  1   1. 

(2)  The   input   sequence   is   uniform. 

(3)  0<  0  <       -J- 

d+1 

Then     V  -      2  -  $(d+l)         ' 

Let   R  =  ^rn-  ,  then  0  <  Z<  2  and  V  =   ^Q    where  Q  =  Q(a*) 
a+i  2-Z 

Note  that  0  <  Q  <  1  and  observe  Fig.  1.   Figures  2  and  3  are 
graphs  of  the  variance  when  Q  =  0  and  therefore  V  =  0  for  all  values  of 
Z. 

Figures  4  and  5  are  graphs  of  the  variance  when  Q  =  ^  and  V  depends 
on  Z. 
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o  Q  =  1.00 
a  Q=  .75 
^       Q  =    .  50 


Figure  1.  V  =  QItt^"    "  ^    ^ 
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likelihood  of  being  correctly  classified  by  the  threshold  element. 

This  likelihood  depends  on  the  value  of  the  gain  vector  and  an  adaptive  al- 
gorithm of  the  "steepest  descent"  variety  can  be  used  to  attempt  to  adjust  the 
gain  vector  to  its  optimal  value  as  the  machine  is  exposed  to  a  stationary  se- 
quence of  statistically  independent  input  vectors.   The  components  of  these 
vectors  are  commonly  two  valued,  and  it  has  been  shown  that  convergence  of  the 
expected  value  of  the  gain  vector  is  dependent  on  the  value  of  the  adjustment 
parameter,  the  values  of  the  components,  and  the  distribution  of  the  input  vec- 
tors.  It  is  shown  herein  that  a  bound  on  the  adjustment  parameter,  simply 
related  to  the  values  of  the  input  components,  is  sufficient  to  insure  this 
convergence.   The  variance  of  the  gain  vector  is  derived  under  the  assumptions 
of  a  uniform  input  sequence  and  oppositely  signed  components  of  equal  magnitude 
and  it  is  shown  that  a  similar  bound  on  the  adjustment  parameter  implies  con- 
vergence of  the  variance.   The  variance  is  graphed  under  representative 
conditions . 
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